Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

addon: Add support for logging agent health checking #42

Merged
merged 19 commits into from
Jun 6, 2024

Conversation

btaani
Copy link
Contributor

@btaani btaani commented Apr 24, 2024

With this PR, the addon agent will perform health checks on the spoke-cluster's ClusterLogForwarder and OTEL collector instances:

  • For the ClusterLogForwarder resource, the health is based on the status fields type Ready is "True"
  • For the OTEL collector, if the .spec.replicas field is 0, it will report AVAILABLE = False to the addon.

Notes to reviewers:

  • The probe will not be able to report the status of the addon if one of the resources (CLF/OTELCol) are not deployed;
  • In the UI if a probe is unsuccessful the status of the addon will show as progressing;
  • Users can see the error message of a probe by looking at the status field of ManagedClusterAddon resource;
  • The status feedback keys/values can be checked by looking at the status of the ManifestWork;

Alternatives:

  • A simpler probe implementation is possible. This simpler approach will only report if the resources in the ManifestWork were created on the spoke cluster.

@coveralls
Copy link

coveralls commented Apr 24, 2024

Pull Request Test Coverage Report for Build 9404429060

Details

  • 62 of 71 (87.32%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+2.6%) to 62.567%

Changes Missing Coverage Covered Lines Changed/Added Lines %
internal/addon/addon.go 61 70 87.14%
Totals Coverage Status
Change from base Build 9404236923: 2.6%
Covered Lines: 234
Relevant Lines: 374

💛 - Coveralls

Copy link
Collaborator

@JoaoBraveCoding JoaoBraveCoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment, currently MCOA supports having multiple signals enabled or disabled so in the ideal scenario the health prober should consider that all signals might not always be enabled (if possible).

@btaani btaani marked this pull request as ready for review May 22, 2024 17:40
@btaani btaani requested review from a team as code owners May 22, 2024 17:40
@btaani btaani force-pushed the health-check branch 2 times, most recently from 61c997a to f65db04 Compare May 22, 2024 17:46
Comment on lines 36 to 30
OtelcolName = "spoke-otelcol"
OtelcolNS = "spoke-otelcol"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to get this information from the ManagedClusterAddOn object. The same for Logging

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming of these variables did not do a good job of portraying what Name/Namespace are we referring to here. These are both the name & namespace of the resources that will be created on the spoke clusters. These don't change and are always the same unless we change the manifests. I've added tests so that if we do this the tests fail and we need to update these names

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @JoaoBraveCoding, it seems that the health checking implementation in this PR is static - if a user modifies ClusterManagementAddOn e.g. adds a new collector to a spoke or cluster set, the health of that collector is not reflected here. This use case was discussed as part of the proposal https://hackmd.io/aBUzPTEZRuCPp_kZqm4x2A?view

The addon should reflect the health of all collectors specified in the placements.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the health checking is indeed static for the moment because the resources MCOA currently generate are also static in their namespace/name (CLF is always called "mcoa-instance", as of this PR, and OTELCol is also called "mcoa-instance", we can change to a different name I'm not hard on that). If the user changes the OTELCol reference on ClusterManagementAddOn, there is no problem because MCOA only cares about the Spec of the resource.

In the hackmd indeed there is a future requirement that talks about deploying multiple instances of OTELCol (Deployment + Daemonset). However, nowadays this is not possible from the POV of the addon-framework. The addon-framework doesn't support 2 or more references of the same GVK. To change this, we will need to open a PR to the addon-framework to implement this functionality.
The same happens with the Health Probes, currently, they do not allow a dynamic list of ProbeFields/ResourceIdentifier this is something that we will need to add support for in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an important requirement for us to support multiple collectors, I am not sure we can rely on getting this change into the addon framework.

@iblancasa could you please chime in here and describe your idea with implementing this in a similar way how secrets are handled by the MCOA?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iblancasa could you please chime in here and describe your idea with implementing this in a similar way how secrets are handled by the MCOA?

The problem we had was related to reconciliation, for instance. I was planning to follow the same approach we currently have for secrets and implement reconciliation and everything else if, for whatever reason this could not be implemented as part of the addon framework. But that will need further discussion and I don't think this thread is the best place to look for that solution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavolloffay FWIW the ACM ticket we now they are interested for external contributions is ACM-11509. IMHO we will get either way into a state that our addon will support multiple GVKs. However we still have time till 2.12 to try contributing to the addon-framework first and then fall back if needed to custom code. WDYT?

Copy link
Collaborator

@JoaoBraveCoding JoaoBraveCoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! Looks promising 👀

internal/addon/addon.go Outdated Show resolved Hide resolved
internal/addon/addon.go Outdated Show resolved Hide resolved
internal/addon/addon.go Outdated Show resolved Hide resolved
internal/addon/addon.go Outdated Show resolved Hide resolved
internal/addon/addon.go Outdated Show resolved Hide resolved
internal/addon/addon.go Outdated Show resolved Hide resolved
internal/addon/addon.go Outdated Show resolved Hide resolved
internal/addon/addon_test.go Outdated Show resolved Hide resolved
@JoaoBraveCoding JoaoBraveCoding force-pushed the health-check branch 2 times, most recently from abfc92b to 26c9fb5 Compare May 31, 2024 10:28
Copy link
Collaborator

@periklis periklis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM main open question is how this works with custom placements in the ClusterManagementAddon on the hub.

internal/addon/addon.go Outdated Show resolved Hide resolved
internal/addon/addon.go Outdated Show resolved Hide resolved
internal/addon/addon.go Outdated Show resolved Hide resolved
internal/addon/addon.go Outdated Show resolved Hide resolved
internal/addon/addon_test.go Outdated Show resolved Hide resolved
Comment on lines 22 to 30
ClusterLogForwardersResource = "clusterlogforwarders"
SpokeCLFName = "instance"
SpokeCLFNamespace = "openshift-logging"
clfProbeKey = "isReady"
clfProbePath = ".status.conditions[?(@.type==\"Ready\")].status"

OpenTelemetryCollectorsResource = "opentelemetrycollectors"
SpokeOTELColName = "spoke-otelcol"
SpokeOTELColNamespace = "spoke-otelcol"
otelColProbeKey = "replicas"
otelColProbePath = ".spec.replicas"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we replicate the resource names here? Doesn't loggingv1 provide one for clusterlogforwarders? Same question for opentelemetrycollectors

@JoaoBraveCoding @iblancasa Do we want to keep deploying instances of opentelemetrycollector on the spokes with the name spoke-otelcol? WDYT start referring them with a prefix like mcoa-...? Same for clusterlogforwarders

@JoaoBraveCoding Can the health prober work with CLFs/OtelCollectors named differently per clusterset? For example one creates a CLF resource called mcoa-eu-clusters-instance for a custom placement in the ClusterManagementAddon. How does probing work here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoaoBraveCoding @iblancasa Do we want to keep deploying instances of opentelemetrycollector on the spokes with the name spoke-otelcol? WDYT start referring them with a prefix like mcoa-...? Same for clusterlogforwarders

Yes, it makes sense.
But, anyway, these values should be get from the AddonConfig not hardcoded.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we replicate the resource names here? Doesn't loggingv1 provide one for clusterlogforwarders? Same question for opentelemetrycollectors

AFAIK unless I'm missing something I didn't find for both, a constant in their respective packages that gives me the plural of the resource.

@JoaoBraveCoding @iblancasa Do we want to keep deploying instances of opentelemetrycollector on the spokes with the name spoke-otelcol? WDYT start referring them with a prefix like mcoa-...? Same for clusterlogforwarders

I'm fine with calling them both mcoa-instance or anything else. @iblancasa At least for now, I don't see a use case where we would need to make these names configurable using AddonConfig.

@JoaoBraveCoding Can the health prober work with CLFs/OtelCollectors named differently per clusterset? For example one creates a CLF resource called mcoa-eu-clusters-instance for a custom placement in the ClusterManagementAddon. How does probing work here?

Yes it can. Because in the end once the CLFs/OtelCollectors are rendered into the manifestworks they all will have the same name (nowadays CLF is instance and OtelCollectors spoke-otelcol) so unless we want to make these names customizable it will work. If, for any reason, we want to make these names change depending on, for instance, placement name then this will not work and we will need improvements to the framework.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with calling them both mcoa-instance or anything else. @iblancasa At least for now, I don't see a use case where we would need to make these names configurable using AddonConfig.

I would prefer to keep it done from the AddonConfig since it's the "source of truth". Also... we will support, at least for the OpenTelemetryCollector, the creation of multiple instances. And those will be provided by users.

internal/addon/addon.go Show resolved Hide resolved
Comment on lines 22 to 30
ClusterLogForwardersResource = "clusterlogforwarders"
SpokeCLFName = "instance"
SpokeCLFNamespace = "openshift-logging"
clfProbeKey = "isReady"
clfProbePath = ".status.conditions[?(@.type==\"Ready\")].status"

OpenTelemetryCollectorsResource = "opentelemetrycollectors"
SpokeOTELColName = "spoke-otelcol"
SpokeOTELColNamespace = "spoke-otelcol"
otelColProbeKey = "replicas"
otelColProbePath = ".spec.replicas"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoaoBraveCoding @iblancasa Do we want to keep deploying instances of opentelemetrycollector on the spokes with the name spoke-otelcol? WDYT start referring them with a prefix like mcoa-...? Same for clusterlogforwarders

Yes, it makes sense.
But, anyway, these values should be get from the AddonConfig not hardcoded.

internal/logging/helm_test.go Outdated Show resolved Hide resolved
internal/tracing/helm_test.go Outdated Show resolved Hide resolved
@JoaoBraveCoding JoaoBraveCoding force-pushed the health-check branch 2 times, most recently from 36a164a to ed25381 Compare June 3, 2024 16:27
Copy link
Collaborator

@JoaoBraveCoding JoaoBraveCoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holding as my latest changes introduced a bug

Copy link
Collaborator

@JoaoBraveCoding JoaoBraveCoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested, issue fixed

@JoaoBraveCoding
Copy link
Collaborator

I'm merging this one since we have approval from the two teams.

@JoaoBraveCoding JoaoBraveCoding merged commit 5f4b5e2 into rhobs:main Jun 6, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants